-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: apply to_tsvector
to the filtered columns when a fts
operator is used
#3845
base: main
Are you sure you want to change the base?
Conversation
b9cee18
to
b8f8db2
Compare
b8f8db2
to
5e61a05
Compare
CHANGELOG.md
Outdated
@@ -26,6 +26,7 @@ This project adheres to [Semantic Versioning](http://semver.org/). | |||
- #3795, Clarify `Accept: vnd.pgrst.object` error message - @steve-chavez | |||
- #3779, Always log the schema cache load time - @steve-chavez | |||
- #3706, Fix insert with `missing=default` uses default value of domain instead of column - @taimoorzaeem | |||
- #2255, Fix `to_tsvector()` not applying explicitly to the full-text search filtered column (on non `tsvector` types) - @laurenceisla |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure if this should be considered a new feature or a fix. I think the user expects the to_tsvector
to be applied under the hood, so I'm adding it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the issue says "enhancement", the PR title says "feat:" - and yet you conclude to add it to the "fixed" section :D
I'd look at this from the angle of backporting. I wouldn't backport it, at least in its current form. I understand you're applying this transformation to every column except those with type tsvector right now. This does have a considerably risk of breaking indexes, we only need to go back as far as when we did something similar with to_jsonb(...)
around everything to allow accessing fields or composite types or so. It did break indexes.
Now.. what if we only apply this transformation for json/jsonb types? That would be a lot less risky, in which case we could think about backporting.
Although... even in this case, I'd still argue that this is more of a feature than a fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the PR title says "feat:"
Oh woops, I left that from the previous iteration (adding the totsv
modifier looked more like a feat in that case for me). Yeah, I get it now.
I understand you're applying this transformation to every column except those with type tsvector right now. This does have a considerably risk of breaking indexes, we only need to go back as far as when we did something similar with
to_jsonb(...)
around everything to allow accessing fields or composite types or so
Yes, although there's a difference in that the to_tsvector()
is already applied implicitly to the text
types with the @@
operator.
text @@ tsquery → boolean
Does text string, after implicit invocation of to_tsvector(), match tsquery?
'fat cats ate rats' @@ to_tsquery('cat & rat') → t
The other types will still fail but with a different error message, e.g. function to_tsquery(integer) does not exist
instead of operator does not exist: integer @@ tsquery
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other types will still fail but with a different error message, e.g.
function to_tsquery(integer) does not exist
instead ofoperator does not exist: integer @@ tsquery
.
Which kind of brings me back to my original question: #2255 (comment)
Did anyone try creating the missing @@ operator for jsonb with a function that does this to_tsvector call explicitly?
Like.. do we actually need this fix feature, or could we just advise people to create the respective cast operator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does have a considerably risk of breaking indexes,
In fact, I think it's the contrary, actually. I've been checking more about FTS indexes and the docs say:
Because the two-argument version of to_tsvector was used in the index above, only a query reference that uses the 2-argument version of to_tsvector with the same configuration name will use that index. That is,
WHERE to_tsvector('english', body) @@ 'a & b'
can use the index, butWHERE to_tsvector(body) @@ 'a & b
cannot. This ensures that an index will be used only with the same configuration used to create the index entries.
As it is right now if we a GIN index to a column of text
type, PostgREST won't use it. This change (appying to all types, not only JSON) would fix that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fail to read your conclusion out of this.
However, I read something else:
Another approach is to create a separate tsvector column to hold the output of to_tsvector. To keep this column automatically up to date with its source data, use a stored generated column.
It seems to me that suggesting to our users to create generated columns with the tsvector output of the json column would solve the problem nicely. Then this generated column can be searched - without any additional operators or to_tsvector
wrappers. It also scales much better, because this solution will easily allow users to add those config arguments ("english" etc.) or create the tsvector column over multiple columns.
If instead, we implement as proposed here, I see the next feature request coming very soon: "I need to pass 'english' to the two-argument-version of tsvector, please let PostgREST do that for me."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like.. do we actually need this
fixfeature, or could we just advise people to create the respectivecastoperator?
It seems to me that suggesting to our users to create generated columns with the tsvector output of the json column would solve the problem nicely.
Yes, that'd be another solution. We'd need to mention that it won't work as expected for text
or json
columns and to use tsvector
generated columns and give the suggestions you list. Nonetheless, I think having an out of the box option to use would be a good experience.
If instead, we implement as proposed here, I see the next feature request coming very soon: "I need to pass 'english' to the two-argument-version of tsvector, please let PostgREST do that for me."
Do you mean to_tsvector(config_language, text)
? Cause... that's what it's implemented in this PR: col=fts(english).val
in SQL would be to_tsvector('english', col) @@ to_tsquery('english', 'val')
(I updated the original post of the PR to reflect this). I figured it made sense having the same config languages in the tsquery and the tsvector, so using the one in fts(language)
for both shouldn't be a probem.
All in all, I'm on the fence here. I see the value in both having the implementation out of the box and recommending generated columns
as an alternative and also for complex cases (computed fields are not included in the PR, for instance).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean
to_tsvector(config_language, text)
? Cause... that's what it's implemented in this PR:col=fts(english).val
in SQL would beto_tsvector('english', col) @@ to_tsquery('english', 'val')
(I updated the original post of the PR to reflect this). I figured it made sense having the same config languages in the tsquery and the tsvector, so using the one infts(language)
for both shouldn't be a probem.
Ah. I didn't look at the code, sorry. And apparently not at the PR description either :D.
OK, now I understand your comment about indexing better as well. And then I see why you'd want to apply this not only to json, but also to text, for consistency.
All in all, I'm on the fence here. I see the value in both having the implementation out of the box and recommending
generated columns
as an alternative and also for complex cases (computed fields are not included in the PR, for instance).
Seems like we should maybe do both. This PR and a better explanation of the generated column approach for special cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we should maybe do both. This PR and a better explanation of the generated column approach for special cases?
Agree, opened #3870 to track this. Meanwhile this PR should be ready for review/merging.
5e61a05
to
42b02b6
Compare
res = fromMaybe (unknownField fn jp) $ HM.lookup qi tables >>= | ||
Just . flip resolveTableFieldName fn | ||
Just . (\t -> resolveTableFieldName t fn toTsV) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a performance concern here: #2255 (comment)
I don't think there will be a difference in performance since it's using this lookup that is already applied for all filters before this PR.
42b02b6
to
846a45c
Compare
to_tsvector
to the filtered columns when a fts
operator is used
… column, only if it's not of tsvector type
01a52f3
to
01aab5e
Compare
@@ -20,6 +21,7 @@ import PostgREST.SchemaCache.Identifiers (FieldName) | |||
import Protolude | |||
|
|||
type TransformerProc = Text | |||
type ToTsVector = Maybe (Maybe Text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is confusing at first glance, why a double Maybe?
@@ -193,7 +193,7 @@ The :code:`fts` filter mentioned above has a number of options to support flexib | |||
|
|||
curl "http://localhost:3000/tsearch?my_tsv=not.wfts(french).amusant" | |||
|
|||
Using `websearch_to_tsquery` requires PostgreSQL of version at least 11.0 and will raise an error in earlier versions of the database. | |||
This also works with columns of text and JSON types, converting them to ``tsvector`` accordingly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A subsection would be better for this. This way we can also highlight the feature on the release notes.
Maybe "automatic tsvector casting"`, "automatic to_tsvector??
Closes #2255
When an
fts
operator is used, then it appliesto_tsvector
to the filtered column. For example:is equivalent to this in SQL:
If the `totsv` modifier (short for `to_tsvector`) is used for `fts` operators then it applies `to_tsvector` to the filtered column. For example:
Is equivalent to this in SQL: